The material for these sessions is written in R Markdown. R Markdown is a plain text format which can be processed to produce nicely formatted output such as HTML. The point of these sessions is not to discuss R Markdown in any real detail, but to learn R, so we won’t dwell on it here. That said, you will likely get somewhat familiar with the syntax simply by working through this document.
There are a number of different output types available when authoring with R Markdown. Here, we are using R Notebooks, or the html_notebook output style. R Notebooks are useful because R code chunks can be inserted by users at any point, and the code in those chunks can be executed independently and output shown.
Try it out. Simply type CTRL+ALT+I to insert a new chunk right wherever the cursor is positioned, or press the ‘insert’ () icon positioned at the top of this pane. Enter an R command or several commands and click the ‘play’ (small green triangle at top-right of the code chunk,
) icon. You can also edit the code in a chunk and re-evaluate it on demand.
Of course, you can also enter commands at any time in the traditional console.
After today you will:
We have a fair bit to cover, so coverage will be intentionally light at times. Readers are encouraged to learn more about things independently as relevant.
RStudio is an integrated development environment (IDE) for writing and executing R code. It is an open source product developed by RStudio Inc.
We will not cover many of the features in any real detail–there are simply too many to talk about here and, the easiest way to learn about RStudio is probably to just start using it. RStudio Inc. also add new features regularly, so users are encouraged to keep an eye on the RStudio blog.
While there are a number of ways of writing and executing code, using RStudio is the recommended approach. At Statistics New Zealand, RStudio runs on a Linux server, and is accessed in a web browser by visiting either of:
or:
RStudio is available to all, though a help desk request is still required. This is so the service center can keep track of the number of users, and also send an FAQ to new users.
RStudio works better in modern browsers. For that reason, it is recommended that you use Google Chrome with RStudio, rather than Internet Explorer.
RStudio IDE
Pane 1, the ‘source’ pane, consists of a tabbed layout containing documents of various types such as R scripts, C source code files, markdown documents, and so on. This pane is only present if such documents are open. If no documents are open, pane 2 will expand to fill the space.
Pane 2, the ‘console’ pane, consists of the traditional interactive R console. In this window, the user can enter single R commands (or code blocks) and execute them. A command history is generally available for the user to cycle through using the ↑ / ↓ keys.
Panes 3 and 4 are customisable. By default, pane 3 contains ‘Environment’ and ‘History’ tabs; while pane 4 contains ‘Files’, ‘Plots’, ‘Packages’, ‘Help’, and ‘Viewer’ tabs.
The general layout of the panes is customisable, and users can collapse and zoom in various ways. Settings can be changed by selecting Tools→Global Options…, then Pane Layout. Alternatively, click the Workspace Panes button ().
RStudio has a number of customisable features, and users should feel free to experiment. Most are found via Tools→Global Options… .
global options
Otherwise, there are a large number of useful keyboard shortcuts. Probably the single most useful combination is CTRL + ENTER–when focus is on a source window this results in the current line (or selection of lines) being executed in the console. Pressing SHIFT + ALT + K brings up a list of keyboard shortcuts:
keyboard shortcuts
Traditionally, R users have launched the system help in a web browser by issuing help.start(). RStudio has a ‘Help’ tab visible by default, the contents of which contain the traditional help, among other things.
Users can browse and search the help, and are encouraged to do so–R help pages generally include runnable examples so are a good source for learning.
If a user wants the help page for a particular function, foo, then a shortcut is to issue ?foo from the command line. If foo belongs to a package which is not yet loaded, its help can be accessed via ?packagename::foo. Try it out:
help.start()
If the browser launched by 'xdg-open' is already running, it is
*not* restarted, and you must switch to its window.
Otherwise, be patient ...
?plot
??dplyr::summarise
More detailed documentation on RStudio than is presented here can be found by selecting Help→RStudio Docs from the file menu.
R is made extensible by way of add-on packages. To load a package called foo, simply run:
library(foo)
or
require(foo)
There are a large number of packages preinstalled and available to all. If a package you wish to use isn’t present, you will be able to install it in your own personal package library.
There are several sources for packages, but you will commonly get them from the official Comprehensive R Archive Network (CRAN). If you wish to install package foo from CRAN, simply run:
install.packages('foo')
Please do not install a package without checking it is available in the site library first. And if you want a package updated, please contact the service centre, rather than installing in your own personal library–it makes sense for everybody to be using the same version of packages where possible.
R is case sensitive, so users coming from languages such as SAS or Visual Basic need to take special care. N and n can happily coexist in R, for example, as there is no ambiguity.
Note also that the server implementation of R runs on a Linux platform, and Linux is sensitive to case in situations where Windows is not. This is particularly important for file and directory names. For example, blah.txt and BLAH.TXT are not the same file when stored on a Linux filesystem and, again, can happily coexist.
The R servers are Linux-based, not Windows. Linux and Windows are fundamentally different operating systems, and the differences will occasionally be visible to users.
Probably the most obvious difference is in the handling of file and directory names. In Linux, case sensitivity is important, as noted already.
In addition to this, directories are separated with forward slashes, /, on a Linux platform, while in Windows they are separated by backslashes, \. Actually, forward slashes can be used in R on any platform, and is the recommended approach (a backslash is the so-called escape character, so using forward slashes for paths is just easier).
Files and directories starting with a period are hidden and will usually not be visible by default, exceptions being hidden files that relate to R such as .Renviron.
Finally, drive letters are specific to Windows–there is no such thing as the ‘C:’ drive, for example. On a Linux system, the root directory is /, and every other file or directory is located somewhere below this. That is, every file can be accessed in the following form:
/folder_1/folder_2/.../folder_n/filename
Several conveniences also exist:. is the current directory, ..‘is the directory below, and ~ is a shortcut for the users’ home folder (the home folder on a Unix machine is something like ‘My Documents’ on Windows). So, for example, ./blah.txt refers to the file blah.txt located in the current directory, ../blah.txt refers to the file blah.txt in the directory below the current directory, and ~/blah.txt refers to the file blah.txt in the user’s home directory. On the R servers, ~ is a shortcut to /home/STATSNZ/<username>.
dir('.')
[1] "data" "img"
[3] "raw" "README.html"
[5] "README.md" "rintro_solutions_cache"
[7] "rintro_solutions_files" "rintro_solutions.html"
[9] "rintro_solutions.nb.html" "rintro_solutions.Rmd"
[11] "rintro.html" "rintro.nb.html"
[13] "rintro.Rmd" "rintro.Rproj"
dir('..')
[1] "data" "img" "R" "rintro"
[5] "Rintro_cache" "Rintro_files" "Rintro.html" "Rintro.Rmd"
[9] "Rintro.Rproj" "styles.css" "test"
dir('~')
[1] "activator" "blah" "Desktop" "Documents"
[5] "Downloads" "Dropbox" "eclipse" "fontconfig"
[9] "galleries" "IdeaProjects" "key.pem" "mbox"
[13] "missfont.log" "mnt" "Music" "Network-Shares"
[17] "node_modules" "npm-debug.log" "PDF" "pgadmin.log"
[21] "Pictures" "Public" "qgis" "R"
[25] "shp" "Templates" "tmp" "usr"
[29] "Work" "workspace" "www"
normalizePath("~")
[1] "/home/cmhh"
One final point relevant to R users, and not related to the file system, is the use of system commands. R functions such as system are often used in code to issue ‘system’ commands.
On a Windows platform, this essentially amounts to running a DOS command; and on a Linux platform it is like issuing a command from any of the available shells (bash, sh, etc.).
System commands differ dramatically between Windows and Linux, so users who make use of such facilities should take care. For example system("dir") would print the contents of the current directory on a Windows machine, but system("ls") would be the rough equivalent on a Linux machine.
R can easily be made to typeset the word ‘Māori’ correctly. R supports unicode, and ‘ā’ is unicode character 0101. For example:
cat("M\u0101ori")
Māori
Of course, this applies to any unicode symbol–\u means that what follows is unicode. See Wikipedia, for example, for a list of characters (ā is under the heading ‘Latin Extended-A’).
R is a relatively big language, and gaining a comprehensive understanding will take some time and effort. However, like other statistical environments, a fair bit can be achieved with relatively little. So, we’ll start with a few examples–purely for illustrative purposes. This will allow us to show what R is capable of, without getting bogged down in low-level syntax. We will then ‘backfill’ the essentials in later sections.
R has many test datasets built-in, and these are very useful when learning or testing code. Some are quite famous, and you might already have some familiarity with them.
# list all built-in packages, part of the 'datasets' package
data(package="datasets")
# show the built-in help for the iris dataset
?iris
# show the iris dataset in RStudio's built-in viewer
# View(iris) # uncomment for interactive use...
# default plot method for a data.frame
plot(iris)
Being a statistical language, R has a large number of common statistical techniques built-in ranging from simple to complex. For example, a simple summary:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Or grouped summaries and aggregates:
with(iris,
aggregate(iris[,1:4],
by = list(species=Species),
FUN = mean))
species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Or a simple regression:
fit.1 <- lm(Petal.Width ~ Species, data = iris)
summary(fit.1)
Call:
lm(formula = Petal.Width ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.626 -0.126 -0.026 0.154 0.474
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.24600 0.02894 8.50 1.96e-14 ***
Speciesversicolor 1.08000 0.04093 26.39 < 2e-16 ***
Speciesvirginica 1.78000 0.04093 43.49 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2047 on 147 degrees of freedom
Multiple R-squared: 0.9289, Adjusted R-squared: 0.9279
F-statistic: 960 on 2 and 147 DF, p-value: < 2.2e-16
plot(fit.1)
One of R’s traditional strengths is it’s plotting capabilities. To see a quick demo, run demo("graphics"). But here’s a few using the iris dataset:
boxplot(Petal.Width ~ Species, data = iris,
ylab = "petal width",
main = "petal width by species")
One of the more powerful features of R is the ability to extend its functionality by way of third-party packages. For example, dplyr provides alternatives for common tasks involving datasets. The same aggregate as above can be re-expressed as follows:
library(dplyr)
group_by(iris, Species) %>% summarise_all(mean)
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Similarly, ggplot2 provides an alternative way of plotting that is very popular. To recreate the boxplot presented above:
library(ggplot2)
ggplot(data=iris, aes(x=Species, y=Petal.Width)) +
geom_boxplot(aes(fill=Species)) +
ylab("petal width") + ggtitle("petal width by species")
Assignment is traditionally achieved by using an assignment operator such as <- or ->, or the assign command, though = will work for right to left assignment. For example, to assign the value 1 to x, issue any of:
x <- 1
1 -> x
x = 1
assign("x", 1)
x
[1] 1
Which operator you use for assignment is largely a matter of preference, despite what you may have heard, but you should probably at least be consistent. Many believe using <- for assignment leads to more readable code, and it is also probably the most common practice amongst long-time R programmers. Moreover, a number of coding guidelines, such as those created by Google, insist on using <- for assignment. We’ll stick with this convention exclusively.
R is weakly typed. In the above examples, the type of x is determined by context–we never explicitly tell R what kind of value x will hold. Moreover, x can be overwritten with a different value as we see fit.
While weak typing can make programs a little easier to write, it has the potential to cause a number of errors. It is not uncommon, for example, to store what looks like a number as a character string, and using a string where a number is expected will generally not work in R.
In some situations, then, we might first check a variable has the expected type, perhaps terminating program execution if it does not (so-called ‘defensive’ programming).
Fortunately, R provides a number of ways we can check the type of an object. For example:
x <- 1
class(x)
[1] "numeric"
str(x)
num 1
str is short for structure, and is an extremely useful command for examining objects as they can become somewhat complex.
| operator | definition | example | result |
|---|---|---|---|
+ |
addition | 1 + 1 |
2 |
- |
subtraction | 1 - 1 |
0 |
* |
multiplication | 2 * 3 |
6 |
/ |
division | 2 / 3 |
0.6666667 |
^, ** |
exponentiation | 2 ^ 3 |
8 |
+ |
modulo | (1:6) %% 3 |
1, 2, 0, 1, 2, 0 |
+ |
integer division | (1:6) %/% 3 |
0, 0, 1, 1, 1, 2 |
| operator | definition | example | result |
|---|---|---|---|
== |
test equality | 1 == 1 |
2 |
& |
and | c(TRUE, TRUE) & c(FALSE, TRUE) |
FALSE, TRUE |
&& |
and | c(TRUE, TRUE) & c(FALSE, TRUE) |
6 |
| |
or | c(FALSE, FALSE) | c(FALSE, TRUE) |
0.6666667 |
|| |
or | c(FALSE, FALSE) || c(FALSE, TRUE) |
FALSE |
< |
less than | c(1,2) < 2 |
TRUE, FALSE |
<= |
less than or equal | c(1,2,3) <= 2 |
TRUE, TRUE, FALSE |
> |
greater than | c(2,3) > 2 |
FALSE, TRUE |
>= |
greater than or equal | c(1,2,3) >= 2 |
FALSE, TRUE, TRUE |
To understand the difference between & and &&, and also between | and || we’ve snuck in some vectors, hopefully without the loss of any clarity (vectors are discussed in more detail later).
Like many operators and functions in R, the binary operators work in a pairwise fashion over vectors. For example, if x is c(m1, m2) and y is c(n1, n2), then x & y is c(m 1& n1, m2 & n2).
&& is similar to &, except that where vectors are involved, the comparison is only made for the first element of each. That is x && y is m1 & n1.
Numeric values come in two broad types: double and integer. Objects that are type double have decimal expansions, while integers do not. R has the notion of a numeric class, and both double and integer types are considered numeric. R also has the notion of an integer class and this, of course, only applies to integers.
Unless specifically stated, a user defined numeric literal will have a type of double. To create an integer, we specify a number followed by L. For example, 1 is a double equal to one, while 1L is an integer equal to one.
It is often not a terribly interesting distinction, but one should be careful when doing arithmetic with doubles which ought to return integer-valued results as precision related issues can mean we obtain 1.999999999 rather than 2, say.
str(1)
num 1
str(1L)
int 1
.Internal(inspect(1))
@558180640fa8 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 1
.Internal(inspect(1L))
@55817e677718 13 INTSXP g0c1 [NAM(2)] (len=1, tl=0) 1
typeof(1)
[1] "double"
typeof(1L)
[1] "integer"
class(1)
[1] "numeric"
class(1L)
[1] "integer"
is.numeric(1)
[1] TRUE
is.numeric(1L)
[1] TRUE
is.integer(1)
[1] FALSE
is.integer(1L)
[1] TRUE
Strings in R have type character. String manipulation is a pretty large topic by itself, so we’ll just cover the basics here.
A string is created simply by enclosing a sequence of characters in single or double quotes, as in:
(x <- "Hello, World!")
[1] "Hello, World!"
Now consider:
length(x)
[1] 1
nchar(x)
[1] 13
As for the numeric types, a single string is really just a vector of strings which has a length of one. Thus, the length function when applied to a single string returns 1. If we want the length of a particular string, we use the nchar (number of characters) function.
This is very important–it is a very common mistake to use length instead of nchar when trying to get the length of a single string.
Strings are concatenated with the paste function. The function takes a comma separated list of strings to concatenate, and an optional parameter sep. sep is a string which is placed between each concatenated string, and defaults to a single space. For example:
paste("a","b","c")
[1] "a b c"
paste("a","b","c", sep="*")
[1] "a*b*c"
paste("a","b","c", sep="")
[1] "abc"
paste("x_", 1:10, sep="")
[1] "x_1" "x_2" "x_3" "x_4" "x_5" "x_6" "x_7" "x_8" "x_9" "x_10"
paste0 is a useful convenience function which is the same as paste, but where sep is set to '' (an empty string). That is, paste0(x) is equivalent to paste0(x, sep='').
paste0("a","b","c")
[1] "abc"
The paste function can also be used to collapse a vector of strings into a single string, and to attach a prefix, say, to a vector.
paste(1:10, collapse=", ")
[1] "1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
paste(paste0("x", 1:10), collapse=" + ")
[1] "x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10"
paste(paste0("x", 1:10), collapse=", ")
[1] "x1, x2, x3, x4, x5, x6, x7, x8, x9, x10"
"Hello, World!" as a variable called s.s programmatically into the two words Hello and World! (hint: ?strsplit).World with your own name (hint: ?gsub).World with your own name (hint: ?substr).needle in the vector c("haystack needle", "needle") (hint: ?agrep).needle in the vector c("haystack needle", "needle") (hint: ?grep).R has a logical type which can take one of two values: TRUE or FALSE. These can be abbreviated to T and F, though the abbreviated form isn’t commonly used. In fact, TRUE and FALSE are protected, while T and F are simply global variables that have been initially assigned the values TRUE and FALSE, respectively. You can overwrite either T or F, so you should probably just avoid this convention altogether.
We have already discussed logical operators, which return logical values, so we are already somewhat familiar with logical types. Logical values are commonly used to extract subsets of rows and columns from array-like structures, and also to control program flow in conditionals like for loops. We’ll talk more about these uses later.
Logical values are used commonly to subset vectors and arrays. We’ll see this in action a little later.
NULLIn R, NULL is the null object. If an object has a value of NULL, it essentially means the variable is not defined. It isn’t quite the same as saying a variable doesn’t exist, though–it just means it isn’t currently assigned a value.
x <- NULL
exists("x")
[1] TRUE
is.null(x)
[1] TRUE
NULL also serves a useful function as a deletion operator of sorts. We’re probably tackling things in slightly the wrong order here since we haven’t yet come across data frames or lists, but consider:
df <- data.frame(a=c(1,2), b=c(3,4), c=c(5,6))
print(df, row.names = FALSE)
a b c
1 3 5
2 4 6
df[,1] <- NULL
print(df, row.names = FALSE)
b c
3 5
4 6
In R, it is common to get NULL if one asks for a non-existent element in a vector, array, or list. For example
l <- list(a=1, b=2)
# this exists:
l[["a"]]
[1] 1
# but this doesn't:
l[["z"]]
NULL
NANA is short for ‘not available’, and denotes a missing value. Missing values are very common, and having an explicit type makes them easy to deal with. Again, the following examples assume knowledge of things we haven’t seen yet, but consider:
(x <- c(1, NA, 2, NA, 3, NA))
[1] 1 NA 2 NA 3 NA
mean(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 2
mean(x[!is.na(x)])
[1] 2
The most fundamental data type in R is a vector–even a scalar is represented internally as a vector of length 1. Vectors in R come in two flavours: so-called atomic and recursive structures.
An atomic vector is one where all elements are of the same type–one of logical, integer, numeric, complex, character, or raw. We’ll talk about the recursive structures later. Either way, we will prefer the term ‘list-like’ or ‘generic vector’. In the case of a generic vector, elements can be of differing types, and each element can be anything at all, including another generic vector.
A vector is commonly created using the function c, such as:
c(1,2,3,4,5)
[1] 1 2 3 4 5
In this case, c is for combine. The function combines all its arguments into an appropriate type of collection.
In R there are many functions which return vectors as a result. An equivalent vector to the above can be created in several different ways, for example:
(1:5)
[1] 1 2 3 4 5
(seq(from = 1, to = 5, by = 1))
[1] 1 2 3 4 5
If x is a vector, then it can be subset by x[y] where y is itself a vector which contains the positions of the desired elements of x. The elements are returned in the order implied by y, and the first element has an index of 1 (not 0, for those accustomed to other programming languages). Interestingly, the same element can be requested more than once.
(x <- c(1, 3, 5, 7, 9))
[1] 1 3 5 7 9
x[c(5,4,3,2,1)]
[1] 9 7 5 3 1
x[c(2, 4)]
[1] 3 7
x[5]
[1] 9
x[c(1,1,2,2,3,3,4,4,5,5)]
[1] 1 1 3 3 5 5 7 7 9 9
At this point, there are a few special cases which are worth noting. First, if a vector has \(n\) elements, then indexes larger than \(n\) will return NA.
For that same vector, if \(1≤m≤n\), then an index of \(-m\) removes element \(m\), while if \(m>n\) then an index of \(-m\) does nothing.
Interestingly, if a non-integer number is used as an index, it is first truncated, and then used as an index. It’s hard to imagine a genuine situation where this might be useful, but we mention it for the sake of completeness.
Finally, an index of 0 returns nothing (well, a vector of length 0).
(x <- c(1, 2, 3, 4, 5))
[1] 1 2 3 4 5
x[6]
[1] NA
x[-1]
[1] 2 3 4 5
x[c(-1,-2)]
[1] 3 4 5
x[-6]
[1] 1 2 3 4 5
x[c(1.1, 1.9, 2.1)]
[1] 1 1 2
x[0]
numeric(0)
R mostly uses what is commonly referred to as a ‘copy on modify’ semantic. Consider:
(x <- 1:5)
[1] 1 2 3 4 5
x[-1]
[1] 2 3 4 5
x
[1] 1 2 3 4 5
Notice that x[-1] returns a copy of x with the first element removed, but x itself is left unchanged. If we want to change x, we need to overwrite it:
(x <- 1:5)
[1] 1 2 3 4 5
x <- x[-1]
x
[1] 2 3 4 5
A vector can also be subset using a vector of logical values. That is, if x is a vector we wish to subset, and y is a vector containing TRUE or FALSE in each position, then x[y] will consist of elements of x in the same positions as TRUE values of y. For example:
(x <- 1:6)
[1] 1 2 3 4 5 6
x[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)]
[1] 1 3 5
It is worth noting that when selecting with a logical vector in this way, if y is shorter than x, then R will recycle y to produce something that matches x. This is a situation where this is handy if done deliberately, but perhaps not otherwise. So, be careful.
(x <- 1:6)
[1] 1 2 3 4 5 6
x[c(TRUE, FALSE)] # sampe as x[c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)]
[1] 1 3 5
x[c(TRUE, FALSE, FALSE)] # same as x[c(TRUE,FALSE,FALSE,TRUE,FALSE,FALSE)]
[1] 1 4
Probably the most useful way to subset a vector is via a logical expression. For example, if e is an expression involving a vector, x, then x[e] will return all the elements of x where e evaluates to TRUE. For example:
x <- 1:10
x <= 5
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
x[x <= 5]
[1] 1 2 3 4 5
There is also a subset function in R which can be used to subset array-like structures in R. It doesn’t really make things any simpler than using boolean vectors in the way already described, but users might prefer it nevertheless.
x <- 1:10
subset(x, subset=x<=5)
[1] 1 2 3 4 5
Again, remember that none of these methods modify x, but instead return a copy of x that has been modified.
Adding to an existing vector is relatively straightforward. Probably the most common methods are by using the c function we’ve already seen, or the append function.
x <- 1:4
c(x, 5)
[1] 1 2 3 4 5
append(x, 5)
[1] 1 2 3 4 5
Modifying a vector is also relatively easy–all we do is assign values to subsets. For example, x[1] will return the first element of x, while x[1] <- value will change the value of the first element. For example:
x <- 1:5
x[1] <- 0
x
[1] 0 2 3 4 5
But this works equally well for several values. For example:
x <- 1:5
x[c(1,2)] <- c(5,4)
x
[1] 5 4 3 4 5
Now let us consider the use of the usual binary and logical operators as they apply to vectors. We’ve already seen logical operators in action a little, but we’ll now consider slightly more detail.
So, consider the expression x+y. There are two possible scenarios:
x and y are vectors of equal lengthx and y are vectors of different lengthIf equal in length, all operations are applied in a pair-wise fashion. If unequal, the shorter vector will be repeated until it matches the length of the longer vector (again, we call this recycling), and then the operator will be applied in a pairwise fashion.
(Note that a scalar is just a vector of length 1, and a vector of length 1 can be recyled to match the length of any vector of positive length.)
x <- 1:4
x + 4:1
[1] 5 5 5 5
x + 1 # the same as x + c(1,1,1,1)
[1] 2 3 4 5
x + 1:2 # the same as x + c(1,2,1,2)
[1] 2 4 4 6
x + 1:3 # the same as x + c(1,2,3,1)
Warning in x + 1:3: longer object length is not a multiple of shorter
object length
[1] 2 4 6 5
Applying functions is another common way of manipulating vectors. There are a huge number of built-in functions which take 1 or more vectors as arguments (and, remember, the binary operators are just functions) and return… well, anything really. And, of course, you can write your own functions, which we’ll discuss a little later.
x.x (hint: ?head).x (hint: ?tail).x in reverse order (hint: ?rev).y (hint: ?runif).y (hint: ?order).y (hint: ?sort).y from first principles (hint: ?sum, ?length).y (hint: ?mean).y (hint: ?sd).z, containing 100 numbers distributed \(N(0,1)\) (hint: ?rnorm).z (hint: ?cut, ?table).z (hint: ?hist).A matrix is similar to a vector except that it has two dimensions, not one. In fact, a vector can often be used in place of a \(1\times{}n\) matrix without ill effect–a \(1\times{}n\) matrix is often referred to as a a row vector after all.
A matrix is created with the matrix function. For example:
(matrix(1:6, nrow=2, ncol=3, byrow=TRUE))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Specifying byrow=TRUE means the matrix is created row-by-row, left-to-right. Specifying byrow=FALSE means the matrix is created column-by-column, left-to-right. The default is FALSE so the argument can be omitted, and it will be as if it had been provided with a value of FALSE.
(matrix(1:4, nrow=2, byrow=TRUE))
[,1] [,2]
[1,] 1 2
[2,] 3 4
(matrix(1:4, nrow=2, byrow=FALSE))
[,1] [,2]
[1,] 1 3
[2,] 2 4
When creating a matrix, you need only specify either of nrow or ncol–one implies the other, given an input vector. If neither is specified, the result is a column vector.
matrix(1:6, nrow=2)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
matrix(1:6, ncol=2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
matrix(1:6)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
t(matrix(1:6))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
The elements of a matrix are accessed similarly to vectors, except that one must specify conditions for two sets of indices.
If A is a matrix, then the element in row i and column j is extracted by via A[i,j].
If either index is left empty then an entire row or column is extracted.
For example, A[i,] extracts row i, while A[,j] extracts column j. Note that in both cases the result is a vector, however, not a matrix.
Just as for vectors, expressions can also be used in place of i and j. For example, odd rows and even columns could be extracted with A[c(TRUE,FALSE),c(FALSE,TRUE)].
(A <- matrix(1:9, nrow=3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
A[2,] # extract second row
[1] 2 5 8
A[,2] # extract second column
[1] 4 5 6
A[2,3] # extract element in second row, third column
[1] 8
A[c(1,3), c(1,3)] # extract rows 1 and 3, and columns 1 and 3
[,1] [,2]
[1,] 1 7
[2,] 3 9
A[c(FALSE,TRUE)] # select every second element by column
[1] 2 4 6 8
The binary operators described for vectors generally work on matrices, and in a similar fashion. That is, if the two operands are matrices of equal order, then operators work on element pairs in the same positions.
We can also mix matrices and other types such as vectors and arguments will repeat as necessary where dimensions are not compatible. For example, if a matrix of order \(m\times{}n\) is multiplied by a vector of length \(n\) then the vector is used to multiply each column the matrix.
(A <- matrix(1:6, nrow=3))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
A + A
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
A * A
[,1] [,2]
[1,] 1 16
[2,] 4 25
[3,] 9 36
A * c(2,3,4)
[,1] [,2]
[1,] 2 8
[2,] 6 15
[3,] 12 24
If one wants genuine matrix multiplication, as opposed to the pairwise multiplication above, then one can use the %*% operator. To calculate an outer product, one can use the %o% operator.
(A <- matrix(1:4, nrow=2))
[,1] [,2]
[1,] 1 3
[2,] 2 4
(B <- matrix(1:4, ncol=2))
[,1] [,2]
[1,] 1 3
[2,] 2 4
A%*%B
[,1] [,2]
[1,] 7 15
[2,] 10 22
1:3%o%1:3
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 6
[3,] 3 6 9
The dimensions of a matrix can be discovered with the dim function, while the number of rows and columns can be found with the nrow and ncol functions, respectively.
The length function, which we’ve already seen, returns the total number of elements in a matrix (internally, a matrix is as a 1-dimensional vector with row and column attributes).
(A <- matrix(1:6, nrow=3))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
dim(A)
[1] 3 2
nrow(A); ncol(A); length(A)
[1] 3
[1] 2
[1] 6
Names can be assigned to the rows and columns of a matrix. This might not seem particularly useful now, but at the very least it can come in handy for purely aesthetic reasons.
(A <- matrix(1:6, nrow=3))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
colnames(A) <- c("c1", "c2")
rownames(A) <- c("r1", "r2", "r3")
A
c1 c2
r1 1 4
r2 2 5
r3 3 6
The colnames and rownames are interesting functions since they are used to both ‘get’ and ‘set’ column and row names.
For example, colnames(A) <- c("c1", "c2") assigns column names to A, while colnames(A) retrieves the column names.
There are plenty of other examples of this in R–square brackets, for example:
(A <- matrix(1:4, nrow=2))
[,1] [,2]
[1,] 1 3
[2,] 2 4
A[2,] <- c(5,7)
A
[,1] [,2]
[1,] 1 3
[2,] 5 7
Finally, simple aggregates can be derived with the colSums, rowSums, colMeans, and rowMeans functions. The use of these functions is somewhat self explanatory.
(A <- matrix(1:6, nrow=3))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
colSums(A)
[1] 6 15
rowSums(A)
[1] 5 7 9
colMeans(A)
[1] 2 5
rowMeans(A)
[1] 2.5 3.5 4.5
| function | definition | example |
|---|---|---|
t |
transpose | A <- matrix(1:6, nrow=3)t(A) |
cbind |
column bind | A <- matrix(1:4, nrow=2)cbind(A, A) |
rbind |
row bind | A <- matrix(1:4, nrow=2)rbind(A, A) |
solve |
solve a system of linear equations–returns inverse of lhs if rhs is omitted |
A <- matrix(1:4, nrow=2)A%*%solve(A) |
diag |
Extract the diagonal of a matrix as a vector, or create a diagonal matrix from a vector | diag(3)diag(1:3) |
det |
determinant | det(matrix(1:4,nrow=2)) |
eigen |
eigenvalues and eigenvectors | (A <- matrix(c(3,1,1,3), nrow=2))e <- eigen(A)v <- e$vectorsvinv <- solve(v)v%*%diag(e$values)%*%vinv |
svd |
singular value decomposition | A <- matrix(c(3,1,1,3), nrow=2)s <- svd(A)s$u%*%diag(s$d)%*%s$v |
chol |
Choleski decomposition | A <- matrix(c(3,1,1,3), nrow=2)t(chol(A))%*%chol(A) |
qr |
QR decomposition | A <- matrix(c(3,1,1,3), nrow=2)decomp <- qr(A)qr.Q(decomp)%*%qr.R(decomp) |
princomp |
principal components | A <- matrix(c(3,1,1,3), nrow=2)princomp(A) |
solve function to solve the following system of equations: \[
\begin{aligned}
-2x + y &= 1\\
x + y &= 4
\end{aligned}
\]?curve, points).An array is like a matrix, except that it can have more than 2 dimensions. In fact, an array created with just two dimensions is a matrix.
(A <- array(1:6, dim=c(3,2)))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
class(A)
[1] "matrix"
is.array(A)
[1] TRUE
An array is created with the array function. The first argument should be a vector containing the elements of the array, and the next, dim is a vector holding the array dimensions.
Arrays of several dimensions can be hard to interpret, so we’ll just provide one more example of an array with three dimensions. In this case, if dim is c(m,n,o) we can think of the resulting array as being made up of o m by n matrices.
(A <- array(1:12, dim=c(3,2,2))) #2 3x2 matrices
, , 1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
A list, sometimes referred to as a generic vector, is similar to a vector, except that each item in a list can be any kind of object. For example, a list could have a scalar as the first element, a vector as the second, a data frame as the third, a list as the fourth, and so on.
It might not be obvious why a list is useful. Indeed, analysts who are primarily interested in examining data sets may never have the need for a list at all.
Nevertheless, many programmers will find lists useful. For example, a list can be used to return multiple objects from a function (functions must return a single object, but you can certainly put two or more objects in a single list and return the list). Or it can provide functionality similar to a dictionary or associative array (but keep in mind that searching by key using lists is \(O(n)\), so can be slower than expected).
The usual way to create a list is by passing a comma separated list of name-value pairs to the list function:
( lst <- list(a=c(1,2), b=c("a", "b"), c=list(a=1, b=2)) )
$a
[1] 1 2
$b
[1] "a" "b"
$c
$c$a
[1] 1
$c$b
[1] 2
str(lst)
List of 3
$ a: num [1:2] 1 2
$ b: chr [1:2] "a" "b"
$ c:List of 2
..$ a: num 1
..$ b: num 2
names(lst)
[1] "a" "b" "c"
Accessing list elements is similar to a vector, except you generally use [[, rather than [. For example:
lst[[1]]
[1] 1 2
lst[["a"]]
[1] 1 2
It is possible to extract elements with [, but this generally returns a list of length 1:
str(lst["a"])
List of 1
$ a: num [1:2] 1 2
To drive this home:
lst["a"]
$a
[1] 1 2
lst["a"]["a"]["a"]["a"]
$a
[1] 1 2
We have singled data frames out here as being probably the most fundamental structure for the typical statistical analyst, giving them their own section, and devoting much more space to them than other types.
A data frame is similar to a matrix or 2-dimensional array. But whereas all elements of an array must be of the same type, a data frame can contain a mixture of column types.
So, a data frame is R’s version of what an analyst would refer to as a data set. We will generally continue to use the term data frame, but the distinction is essentially one of semantics.
Columns of a data frame can (and should) be named, and in this case we can think of them as fields or variables, and refer to them by name rather than by column position, say, as we did for arrays.
RStudio has a file menu which can be used to import datasets:
Import Dataset
This can all be done programmatically, and in the background there are three approaches used:
readr package for delimited text filesreadxl package for Excel fileshaven for SAS, SPSS and Stata filesThe UI approach works well, but there are good reasons to import datasets programmatically. And, usefully, the import dialog also includes the R code that is used to import so one could import data via the UI once, but then borrow the R code used for future, similar imports.
Note that users can persist data in R’s native format also, and facilities exist to read and write such data to disk. It should be somewhat obvious that storing data in R’s native format is likely to be a useful thing to do, even when the data originated in a different format.
Anyway, we cover the basics by providing the same dataset in 3 formats: sas7bdat, csv, and xlsx. These are stored as data/hhsurvey.sas7bdat, data/hhsurvey.csv, and data/hhsurvey.xlsx, respectively. The original SAS dataset is from the book ‘Principles of Econometrics’, and was downloaded from here.
hhsurvey.sas <- haven::read_sas("data/hhsurvey.sas7bdat")
head(hhsurvey.sas)
# A tibble: 6 × 6
A ALCOH FOOD K TRPORT X
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 8.99 157.05 0 80.51 692
2 2 17.75 70.78 0 40.72 272
3 2 2.97 177.20 0 29.31 1130
4 2 13.50 75.11 2 38.11 535
5 2 47.41 147.89 0 108.27 767
6 2 49.73 116.94 0 503.75 682
hhsurvey.xlsx <- readxl::read_excel("data/hhsurvey.xlsx")
head(hhsurvey.xlsx)
# A tibble: 6 × 6
A ALCOH FOOD K TRPORT X
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 8.99 157.05 0 80.51 692
2 2 17.75 70.78 0 40.72 272
3 2 2.97 177.20 0 29.31 1130
4 2 13.50 75.11 2 38.11 535
5 2 47.41 147.89 0 108.27 767
6 2 49.73 116.94 0 503.75 682
hhsurvey.csv <- readr::read_csv("data/hhsurvey.csv")
head(hhsurvey.csv)
# A tibble: 6 × 6
A ALCOH FOOD K TRPORT X
<int> <dbl> <dbl> <int> <dbl> <int>
1 3 8.99 157.05 0 80.51 692
2 2 17.75 70.78 0 40.72 272
3 2 2.97 177.20 0 29.31 1130
4 2 13.50 75.11 2 38.11 535
5 2 47.41 147.89 0 108.27 767
6 2 49.73 116.94 0 503.75 682
We can compress and save any of the datasets in R’s native format as follows:
saveRDS(hhsurvey.sas, "data/hhsurvey.rds")
Reading it back in is as simple as:
hhsurvey.rds <- readRDS("data/hhsurvey.rds")
Again, saving data in R’s native format to be read in later is very convenient. The files are efficiently compressed, so take up as little space as possible, and the files can be opened relatively quickly since no conversion is required as they are already (after being uncompressed) in an R-readable format. And, it works for any R object, not just data frames. Of course, If the intention is to have a dataset that can be opened in a range of tools, then something universal like CSV is probably best.
A data frame can be created with the data.frame function. The first formal argument of data.frame is ..., often referred to as ellipsis or dot-dot-dot. This is a special object in R, and is discussed in more detail later when learning more about functions. For now, it will simply be taken to be a placeholder for a set of parameters of arbitrary length.
The most basic method of creating a data frame is to pass a vector for each column we wish to appear in the result, and / or arrays as appropriate. Consider the following:
(df <- data.frame(a = c("a", "b", "c", "d"),
b = 1:4,
c = matrix(1:8, nrow=4)))
a b c.1 c.2
1 a 1 1 5
2 b 2 2 6
3 c 3 3 7
4 d 4 4 8
There are 3 named arguments here: a, a character vector; b, a numeric vector; and c, a matrix consisting of two columns. The data frame takes its column names from the argument names, in this case a, b and c. However, c has two columns, so R used c.1 and c.2 in order to ensure uniqueness.
You can use pretty much any array-like object like the above, including another data frame. The only real requirement is that the implied number of rows is the same for each argument.
Column names can be used to access the columns of a data frame directly, and so provide an invaluable convenience. It is in this context that the interpretation of a column as a variable makes sense. For example, we can access the second column of the above data frame in the following ways:
df[,2]
[1] 1 2 3 4
df[,"b"]
[1] 1 2 3 4
df$b
[1] 1 2 3 4
The column names of a data frame are held in an attribute called names. They can be extracted in a number of ways, but for now we’ll limit discussion to either of the colnames or names functions, which are interchangeable.
These functions can be used to both get and set variable names. For example, let’s extract the names, but then also rename c.1 and c.2 as c and d, respectively. For good measure, let’s also then change the names to upper case…
colnames(df)
[1] "a" "b" "c.1" "c.2"
names(df)
[1] "a" "b" "c.1" "c.2"
colnames(df)[3:4] <- c("c", "d") # set names of columns 3 and 4
names(df) <- toupper(names(df)) # toupper converts to upper case
df
A B C D
1 a 1 1 5
2 b 2 2 6
3 c 3 3 7
4 d 4 4 8
Note that the rows of a data frame are themselves named. Row names can be also be extracted in a variety of ways, but we’ll only consider the rownames function for now.
By default, row names are simply set equal to a character representation of the row numbers. For example, row 1 will have the name "1". But, of course, this won’t necesarily be the case once we start aggregating and subsetting.
Finally, there are a number of ways of learning about a data frame programmatically. We’ve already seen the str function earlier, and this can be very useful if we want to quickly learn about the structure of a data frame.
And just as the length function might not have behaved as expected for strings, its behaviour for data frames might not seem sensible to start.
For data frames, the length function returns the number of columns. This is because the internal implementation of a data frame is as a list of vectors, and the length function is returning the number of items in the list (we’ll learn more about lists soon).
The number of rows is obtained via the nrow function, and the number of columns is obtained with the ncol function.
In addition, we can test the class and type of R objects with class and typeof, or test if an object is a data frame with is.data.frame.
df <- data.frame(a=c("a", "b", "c"), b=1:3)
str(df)
'data.frame': 3 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 3
$ b: int 1 2 3
nrow(df)
[1] 3
ncol(df)
[1] 2
length(df)
[1] 2
dim(df)
[1] 3 2
dimnames(df)
[[1]]
[1] "1" "2" "3"
[[2]]
[1] "a" "b"
colnames(df)
[1] "a" "b"
names(df)
[1] "a" "b"
rownames(df)
[1] "1" "2" "3"
class(df)
[1] "data.frame"
typeof(df)
[1] "list"
is.data.frame(df)
[1] TRUE
Why is all this important? Well, when you come to writing your own programs and functions you will often need conditional logic that does different things depending on the input.
Does a data frame contain a particular field, and is that field the right data type? If not, should program execution terminate, or is there some fallback option?
And so on…
Subsetting a data frame is basically the same as for a matrix. The general syntax is df[a,b] where df is a data frame, a controls which rows are extracted, and b controls which columns are extracted. a and b can be any of the following:
TRUE or FALSE) vector, whereTRUE are returnedAlternatively, as already seen, we could extract a field called a from data frame df with df$a or df[,"a"].
We’ll demonstrate each of these possibilities with examples, and we’ll use a data frame as follows:
set.seed(1) # choose seed to ensure repeatable results
(df <- data.frame(a=rnorm(10,0,1), b=rnorm(10,0,1), c=rnorm(10,0,1)))
a b c
1 -0.6264538 1.51178117 0.91897737
2 0.1836433 0.38984324 0.78213630
3 -0.8356286 -0.62124058 0.07456498
4 1.5952808 -2.21469989 -1.98935170
5 0.3295078 1.12493092 0.61982575
6 -0.8204684 -0.04493361 -0.05612874
7 0.4874291 -0.01619026 -0.15579551
8 0.7383247 0.94383621 -1.47075238
9 0.5757814 0.82122120 -0.47815006
10 -0.3053884 0.59390132 0.41794156
So, for example, to extract the first 5 rows and the first and last columns, we could use indices as appropriate:
df[1:5, c(1,3)]
a c
1 -0.6264538 0.91897737
2 0.1836433 0.78213630
3 -0.8356286 0.07456498
4 1.5952808 -1.98935170
5 0.3295078 0.61982575
Or, we could use boolean vectors:
df[c(rep(TRUE, 5), rep(FALSE, 5)), c(T,F,T)]
a c
1 -0.6264538 0.91897737
2 0.1836433 0.78213630
3 -0.8356286 0.07456498
4 1.5952808 -1.98935170
5 0.3295078 0.61982575
Or, we could use row and column names:
df[c("1","2","3","4","5"), c("a","c")]
a c
1 -0.6264538 0.91897737
2 0.1836433 0.78213630
3 -0.8356286 0.07456498
4 1.5952808 -1.98935170
5 0.3295078 0.61982575
In order to demonstrate a conditional subset, let us now extract rows where all variables have positive values:
df[df$a>0 & df$b>0 & df$c>0,]
a b c
2 0.1836433 0.3898432 0.7821363
5 0.3295078 1.1249309 0.6198257
df[df[,"a"]>0 & df[,"b"]>0 & df[,"c"]>0,]
a b c
2 0.1836433 0.3898432 0.7821363
5 0.3295078 1.1249309 0.6198257
df[df[,1]>0 & df[,2]>0 & df[,3]>0,]
a b c
2 0.1836433 0.3898432 0.7821363
5 0.3295078 1.1249309 0.6198257
For relatively complicated expressions, these methods can become somewhat cumbersome. The subset function provides a useful convenience, and allows us to refer to columns as though they were variables in their own right. For example:
subset(df, a>0&b>0&c>0)
a b c
2 0.1836433 0.3898432 0.7821363
5 0.3295078 1.1249309 0.6198257
In this example, a>0 is equivalent to df$a>0, and a is understood to belong to df based on context.
A similar approach uses the with function:
with(df, df[a>0 & b>0 & c>0,])
a b c
2 0.1836433 0.3898432 0.7821363
5 0.3295078 1.1249309 0.6198257
In this example, an environment is constructed from df, and its columns can be referred to directly without reference to df within the context of the calling function. (Environments are a relatively advanced topic, so don’t worry about the details–see ?environment for further details if interested, though.)
data function. What happens if you call data() without any arguments, though?iris dataset. List the number of rows, number of columns, and the column names.?unique, ?table)?plot on the anscombe dataset?anscombe dataset.anscombe data, fit the following linear models: y1~x1, y2~x2, y3~x3, and y4~x4 (hint: ?lm).Creating a new variable is relatively straightforward. We simply use the variables of a data frame in an expression as we would for any other variable. We can assign values to data frame variables using the same notation we used to extract them; and if a variable doesn’t yet exist it will be created. Let us use the same data frame as in the previous section, and let us add a new variable, d, which is the average of a, b, and c…
set.seed(1)
(df <- data.frame(a=rnorm(10,0,1), b=rnorm(10,0,1), c=rnorm(10,0,1)))
a b c
1 -0.6264538 1.51178117 0.91897737
2 0.1836433 0.38984324 0.78213630
3 -0.8356286 -0.62124058 0.07456498
4 1.5952808 -2.21469989 -1.98935170
5 0.3295078 1.12493092 0.61982575
6 -0.8204684 -0.04493361 -0.05612874
7 0.4874291 -0.01619026 -0.15579551
8 0.7383247 0.94383621 -1.47075238
9 0.5757814 0.82122120 -0.47815006
10 -0.3053884 0.59390132 0.41794156
df$d <- (df$a + df$b + df$c)/3
df
a b c d
1 -0.6264538 1.51178117 0.91897737 0.60143491
2 0.1836433 0.38984324 0.78213630 0.45187429
3 -0.8356286 -0.62124058 0.07456498 -0.46076807
4 1.5952808 -2.21469989 -1.98935170 -0.86959026
5 0.3295078 1.12493092 0.61982575 0.69142148
6 -0.8204684 -0.04493361 -0.05612874 -0.30717691
7 0.4874291 -0.01619026 -0.15579551 0.10514776
8 0.7383247 0.94383621 -1.47075238 0.07046951
9 0.5757814 0.82122120 -0.47815006 0.30628416
10 -0.3053884 0.59390132 0.41794156 0.23548483
This approach works well enough for simple derviations–those that consist only of binary operators, for example. More complicated derivations require a different approach. For example, let us now set c to be the minimum of a, b, and c.
We consider several approaches, all of which involve a bit of a leap. That is, they rely on techniques we have not yet discussed, so don’t worry too much if they don’t make complete sense right now.
One approach, familiar to SAS users, is to process a data set iteratively. To do this, we use a for loop to emulate a row-by-row operation.
for (i in 1:nrow(df)){
df[i, "d"] = min(df[i, c("a","b","c")])
}
df
a b c d
1 -0.6264538 1.51178117 0.91897737 -0.6264538
2 0.1836433 0.38984324 0.78213630 0.1836433
3 -0.8356286 -0.62124058 0.07456498 -0.8356286
4 1.5952808 -2.21469989 -1.98935170 -2.2146999
5 0.3295078 1.12493092 0.61982575 0.3295078
6 -0.8204684 -0.04493361 -0.05612874 -0.8204684
7 0.4874291 -0.01619026 -0.15579551 -0.1557955
8 0.7383247 0.94383621 -1.47075238 -1.4707524
9 0.5757814 0.82122120 -0.47815006 -0.4781501
10 -0.3053884 0.59390132 0.41794156 -0.3053884
Loops in R are relatively slow, and if performance is an issue they should be avoided. A vectorised solution is one that avoids any explicit looping, and should generally be the goal. The solution here is the apply function. The syntax is:
apply(X, MARGIN, FUN, ...)
X in this case is a data frame; MARGIN is 1 for rows, 2 for columns, and c(1,2) for rows and columns; and FUN is a function. In this case, we want to apply the minimum function, min, over the rows of df…
df$d <- apply(df[, c("a", "b", "c")], 1, min)
df
a b c d
1 -0.6264538 1.51178117 0.91897737 -0.6264538
2 0.1836433 0.38984324 0.78213630 0.1836433
3 -0.8356286 -0.62124058 0.07456498 -0.8356286
4 1.5952808 -2.21469989 -1.98935170 -2.2146999
5 0.3295078 1.12493092 0.61982575 0.3295078
6 -0.8204684 -0.04493361 -0.05612874 -0.8204684
7 0.4874291 -0.01619026 -0.15579551 -0.1557955
8 0.7383247 0.94383621 -1.47075238 -1.4707524
9 0.5757814 0.82122120 -0.47815006 -0.4781501
10 -0.3053884 0.59390132 0.41794156 -0.3053884
There are a family of so-called apply functions. We won’t discuss them in any real detail, but interested readers are encouraged to investigate more for themselves.
To finish up, let us revisit the suggestion that loops suffer from poor performance compared to the vectorised equivalent. Let us create a data frame similar to df, but with 10000 rows rather than 10. Then, we again derive the minimum of a, b, and c using two methods outlined while timing each.
df <- data.frame(a=rnorm(10000,0,1), b=rnorm(10000,0,1), c=rnorm(10000,0,1))
(a <- system.time( for (i in 1:nrow(df)) df[i, "d"] <- min(df[i, c("a","b","c")])))
user system elapsed
3.492 0.016 3.541
(b <- system.time( df$d <- apply(df, 1, min) ))
user system elapsed
0.020 0.000 0.019
So, vectorising in this case proves much faster than the equivalent loop! For large datasets this improvement is far from trivial.
There are three main methods of combining data which we will discuss here:
Combining data by columns is straight forward and can be done when constructing a data frame, or else two data frames can be combined using the cbind function. In order to be combined by column, data frames must have the same number of rows. For example:
df1 <- data.frame(a=1:5, b=6:10)
df2 <- data.frame(c=11:15, d=16:20)
cbind(df1, df2)
a b c d
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Combining data frames by row is also straight forward and is conveniently achieved using the rbind function. When combining by row, data frames should have the same number of columns, and column names should match.
df1 <- data.frame(a=c(1,4), b=c(2,5), c=c(3,6))
df2 <- data.frame(a=c(7,10), b=c(8,11), c=c(9,12))
rbind(df1, df2, make.row.names = FALSE)
a b c
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
Merging data in R is conveniently achieved by using the merge function. We’ll demonstrate with with the following simple data frames…
set.seed(1)
(df1 <- data.frame(a=c("a", "a", "b", "b", "c", "c", "d", "d"),
b=rep(c(1,2),4), c=rpois(8,10)))
a b c
1 a 1 8
2 a 2 10
3 b 1 7
4 b 2 11
5 c 1 14
6 c 2 12
7 d 1 11
8 d 2 9
(df2 <- data.frame(a=c("a", "a", "b", "b", "c", "c", "d", "d"),
b=rep(c(2,3),4), d=rpois(8,10)))
a b d
1 a 2 14
2 a 3 11
3 b 2 8
4 b 3 2
5 c 2 8
6 c 3 10
7 d 2 7
8 d 3 12
In its simplest form, one can simply call merge with only the two participating data frames as arguments:
merge(df1, df2)
a b c d
1 a 2 10 14
2 b 2 11 8
3 c 2 12 8
4 d 2 9 7
Borrowing database terminology, the default is what we would call an inner join. If not specified, the variables participating in the join are determined automatically as being all variables that appear in both data frames. In this case, a and b appear in both datasets, so these are the variables used in the join. In fact, the previous merge is equivalent to the following:
merge(df1, df2, by=c("a","b"), all.x=FALSE, all.y=FALSE)
Note that it is not a requirement that either data frame be sorted. By default, though, R will sort the result by the joining variables.
A left outer join is obtained when we retain all rows of the data frame on the left hand side of the join (think of the merge as a Venn diagram with df1 on the left, and df2 on the right) whether or not there are matching rows in the data frame on the right. To achieve this, we set all.x=TRUE. Similarly, to perform a right outer join we would set all.y=TRUE. Keeping all rows of either data frame results in what is known as a full outer join, and it is achieved by setting both all.x=TRUE and all.y=TRUE.
merge(df1, df2, by=c("a","b"), all.x=TRUE, all.y=FALSE) # left join
a b c d
1 a 1 8 NA
2 a 2 10 14
3 b 1 7 NA
4 b 2 11 8
5 c 1 14 NA
6 c 2 12 8
7 d 1 11 NA
8 d 2 9 7
merge(df1, df2, by=c("a","b"), all.x=FALSE, all.y=TRUE) # right join
a b c d
1 a 2 10 14
2 a 3 NA 11
3 b 2 11 8
4 b 3 NA 2
5 c 2 12 8
6 c 3 NA 10
7 d 2 9 7
8 d 3 NA 12
merge(df1, df2, by=c("a","b"), all.x=TRUE, all.y=TRUE) # full join
a b c d
1 a 1 8 NA
2 a 2 10 14
3 a 3 NA 11
4 b 1 7 NA
5 b 2 11 8
6 b 3 NA 2
7 c 1 14 NA
8 c 2 12 8
9 c 3 NA 10
10 d 1 11 NA
11 d 2 9 7
12 d 3 NA 12
In this section, we use data from the survey package for illustration. Specifically, the package provides data on the acamdemic performance in Californian schools. To make the data available:
data(api, package="survey")
As a result of this call, a number of data frames will be added to the user’s workspace. We’ll focus on just two: apipop and apisrs.
apipop is a data frame containing information about all Californian schools with at least 100 students, while apisrs is a simple random sample of 200 schools.
We use the population data frame to illustrate simple frequencies, and the sample data frame to illustrate the construction of weighted estimates.
head(apipop)
cds stype name sname snum
1 01611190130229 H Alameda High Alameda High 1
2 01611190132878 H Encinal High Encinal High 2
3 01611196000004 M Chipman Middle Chipman Middle 3
4 01611196090005 E Lum (Donald D.) Lum (Donald D.) Elementary 4
5 01611196090013 E Edison Elementa Edison Elementary 5
6 01611196090021 E Otis (Frank) El Otis (Frank) Elementary 6
dname dnum cname cnum flag pcttest api00 api99 target
1 Alameda City Unified 6 Alameda 1 NA 96 731 693 5
2 Alameda City Unified 6 Alameda 1 NA 99 622 589 11
3 Alameda City Unified 6 Alameda 1 NA 99 622 572 11
4 Alameda City Unified 6 Alameda 1 NA 99 774 732 3
5 Alameda City Unified 6 Alameda 1 NA 99 811 784 1
6 Alameda City Unified 6 Alameda 1 NA 100 780 725 4
growth sch.wide comp.imp both awards meals ell yr.rnd mobility acs.k3
1 38 Yes Yes Yes Yes 14 16 <NA> 9 NA
2 33 Yes No No No 20 18 <NA> 13 NA
3 50 Yes Yes Yes Yes 55 25 <NA> 20 NA
4 42 Yes Yes Yes Yes 35 26 <NA> 21 20
5 27 Yes Yes Yes Yes 15 9 <NA> 11 20
6 55 Yes Yes Yes Yes 25 18 <NA> 12 20
acs.46 acs.core pct.resp not.hsg hsg some.col col.grad grad.sch avg.ed
1 NA 25 91 6 16 22 38 18 3.45
2 NA 27 84 11 20 29 31 9 3.06
3 26 27 86 11 31 30 20 8 2.82
4 30 NA 96 3 22 29 31 15 3.32
5 29 NA 96 3 9 29 26 33 3.76
6 29 NA 87 6 11 28 41 13 3.44
full emer enroll api.stu
1 85 16 1278 1090
2 90 10 1113 840
3 80 12 546 472
4 96 4 330 272
5 95 5 233 216
6 90 5 276 247
head(apisrs)
cds stype name sname
1039 15739081534155 H McFarland High McFarland High
1124 19642126066716 E Stowers (Cecil Stowers (Cecil B.) Elementary
2868 30664493030640 H Brea-Olinda Hig Brea-Olinda High
1273 19644516012744 E Alameda Element Alameda Elementary
4926 40688096043293 E Sunnyside Eleme Sunnyside Elementary
2463 19734456014278 E Los Molinos Ele Los Molinos Elementary
snum dname dnum cname cnum flag pcttest
1039 1039 McFarland Unified 432 Kern 14 NA 98
1124 1124 ABC Unified 1 Los Angeles 18 NA 100
2868 2868 Brea-Olinda Unified 79 Orange 29 NA 98
1273 1273 Downey Unified 187 Los Angeles 18 NA 99
4926 4926 San Luis Coastal Unified 640 San Luis Obispo 39 NA 99
2463 2463 Hacienda la Puente Unif 284 Los Angeles 18 NA 93
api00 api99 target growth sch.wide comp.imp both awards meals ell
1039 462 448 18 14 No Yes No No 44 31
1124 878 831 NA 47 Yes Yes Yes Yes 8 25
2868 734 742 3 -8 No No No No 10 10
1273 772 657 7 115 Yes Yes Yes Yes 70 25
4926 739 719 4 20 Yes Yes Yes Yes 43 12
2463 835 822 NA 13 Yes Yes Yes No 16 19
yr.rnd mobility acs.k3 acs.46 acs.core pct.resp not.hsg hsg some.col
1039 <NA> 6 NA NA 24 82 44 34 12
1124 <NA> 15 19 30 NA 97 4 10 23
2868 <NA> 7 NA NA 28 95 5 9 21
1273 <NA> 23 23 NA NA 100 37 40 14
4926 <NA> 12 20 29 NA 91 8 21 27
2463 <NA> 13 19 29 NA 71 1 8 20
col.grad grad.sch avg.ed full emer enroll api.stu pw fpc
1039 7 3 1.91 71 35 477 429 30.97 6194
1124 43 21 3.66 90 10 478 420 30.97 6194
2868 41 24 3.71 83 18 1410 1287 30.97 6194
1273 8 1 1.96 85 18 342 291 30.97 6194
4926 34 10 3.17 100 0 217 189 30.97 6194
2463 38 34 3.96 75 20 258 211 30.97 6194
For a full list of the variables available for each dataset, see ?survey::api (the colons here describe a names space, and they allow us to search the help for a package without needing to explicitly load it).
The simplest summary we can make is probably a basic contingency table. This is done using the table function. Tables of many dimensions can be created, but we’ll just consider a simple two-way table.
Specifically, let us use the apipop data to tabulate the number of schools by type and County. County is provided by the variable cname, while school type is provided by stype, which is a categorical variable with values E, M, and H (for elementary, middle, and high, presumably).
(t1 <- table(County=apipop$cname, "school type"=apipop$stype))[1:10,]
school type
County E H M
Alameda 196 31 52
Amador 6 2 2
Butte 30 9 9
Calaveras 7 1 2
Colusa 4 3 2
Contra Costa 120 26 33
Del Norte 6 1 1
El Dorado 26 5 9
Fresno 127 25 34
Glenn 5 2 2
Note that tables can participate in binary operation much like other array-like structures. For example, it is easy enough to create a similar table, but with row proportions instead:
(t1 <- with(apipop, table(cname, stype)))[1:10,]
stype
cname E H M
Alameda 196 31 52
Amador 6 2 2
Butte 30 9 9
Calaveras 7 1 2
Colusa 4 3 2
Contra Costa 120 26 33
Del Norte 6 1 1
El Dorado 26 5 9
Fresno 127 25 34
Glenn 5 2 2
(t2 <- with(apipop, table(cname)))[1:10]
cname
Alameda Amador Butte Calaveras Colusa
279 10 48 10 9
Contra Costa Del Norte El Dorado Fresno Glenn
179 8 40 186 9
(t1 / as.numeric(t2))[1:10,]
stype
cname E H M
Alameda 0.7025090 0.1111111 0.1863799
Amador 0.6000000 0.2000000 0.2000000
Butte 0.6250000 0.1875000 0.1875000
Calaveras 0.7000000 0.1000000 0.2000000
Colusa 0.4444444 0.3333333 0.2222222
Contra Costa 0.6703911 0.1452514 0.1843575
Del Norte 0.7500000 0.1250000 0.1250000
El Dorado 0.6500000 0.1250000 0.2250000
Fresno 0.6827957 0.1344086 0.1827957
Glenn 0.5555556 0.2222222 0.2222222
More complicated summaries can be produced using the either of the aggregate or by functions.
Let us start with a simple example: mean academic performance index (API) by district. The variables api99 and api00 provide the API, while district name is stored as dname. Consider the following:
with(apipop, by(cbind(api99, api00), stype, colMeans))
INDICES: E
api99 api00
633.1613 672.0627
--------------------------------------------------------
INDICES: H
api99 api00
621.0530 633.7947
--------------------------------------------------------
INDICES: M
api99 api00
634.5462 655.7230
The first argument is the thing being aggregated (in this case, both api99 an api00), the second argument is the by variable (or variables) as a list, and the third argument is the aggregating function–in this case colMeans (for column means, since the data has two columns). The result of by is essentially a vector when a single variable is to be aggregated, and a list equal in length to the number of groups otherwise.
The aggregate function is probably more useful in general. To make the equivalent set of aggregates with aggregate:
with(apipop, aggregate(cbind(api99, api00), list(stype=stype), mean))
stype api99 api00
1 E 633.1613 672.0627
2 H 621.0530 633.7947
3 M 634.5462 655.7230
The arguments are broadly the same as before, but we’ve used mean as the aggregating function. For the most part, any function that operates over a vector will work just fine. For example, we could just as easily have used length (to produce counts), median, sd (for standard deviation), sum, and so on.
To achieve the equivalent table for sample data we must take account of survey weights. In the apisrs dataset, the appropriate weights are stored in the variable pw. For a simple sum, we would then simply multiply the variable(s) we are aggregating and the survey weights. For a weighted mean, we might have to be a little more clever:
with(apisrs, aggregate(cbind(api99, api00)*pw/sum(pw)*nrow(apisrs),
list(stype=stype), mean))
stype api99 api00
1 E 627.3662 666.1408
2 H 592.6400 605.3600
3 M 637.4242 654.2727
An alternative format is to specify the table by way of a formula. In R, a formula is of the form:
lhs ~ rhs.1 + rhs.2 + ...
The same table as above specified via a formula is:
aggregate(cbind(api99, api00)~stype, apipop, mean)
stype api99 api00
1 E 633.1613 672.0627
2 H 621.0530 633.7947
3 M 634.5462 655.7230
The use of aggregate and by as described above does turn out to be a little cumbersome. We covered it, however, because it is part of the base language, and so is the traditional method. Code written with base functionality will always work, and doesn’t rely on certain packages being present.
However, going forward we recommend the use of the dplyr package instead. The syntax is more intuitive, and the performance is often far better than the base functions discussed (up to 40 times faster on my machine, for example). This package is very popular–so much so that if you were to distribute code that made use of the package, most would be able to run it without issue.
We won’t discuss the package in huge detail, but we will show how to reproduce the previous weighted mean example. See Introduction to dplyr for a reasonable introduction.
library(dplyr)
grp <- group_by(apisrs, stype)
summarise(grp, api99=sum(api99*pw)/sum(pw), api00=sum(api00*pw)/sum(pw))
# A tibble: 3 × 3
stype api99 api00
<fctr> <dbl> <dbl>
1 E 627.3662 666.1408
2 H 592.6400 605.3600
3 M 637.4242 654.2727
Alternatively, we can make use of the so-called pipe operator, %>%:
group_by(apisrs, stype) %>%
summarise(api99=sum(api99*pw)/sum(pw), api00=sum(api00*pw)/sum(pw))
# A tibble: 3 × 3
stype api99 api00
<fctr> <dbl> <dbl>
1 E 627.3662 666.1408
2 H 592.6400 605.3600
3 M 637.4242 654.2727
Using the pipe operator, the following:
bar(foo(x), y)
is rewritten:
foo(x) %>% bar(y)
The pipe operator is a convenience function that proves useful in situations where an expression is composed of several or more function calls. For example:
merge(d, merge(c, merge(a, b)))
could be rewritten:
merge(a, b) %>% merge(c) %>% merge(d)
which, I think, is far more readable.
The basic syntax for creating a function is:
foo <- function(signature){
<function body>
}
We’ve already seen many examples of function calls so far, but to call the function:
x <- foo(arguments)
You can do some relatively advanced things with functions in R. We’ll just cover the basics, but what we cover should suffice for the majority of users.
We’ll look at functions by way of example, starting very simply, and then gradually considering more advanced features as we go.
Let us start with one of the most basic functions we can think of, and something a little traditional:
hello <- function(){
cat("Hello, World!")
}
hello()
Hello, World!
Note that functions in R always return a value, even if that value is simply NULL. So:
x <- hello()
Hello, World!
x
NULL
Either way, let us add an explicit return value:
hello <- function(){
return("Hello, World!")
}
x <- hello()
x
[1] "Hello, World!"
For interactive use, R has a neat little function called invisible. This function simply returns a copy of an object that is temporarily invisible. To demonstrate:
foo <- function(x) return(invisible(x))
foo(1) # returns 1, but nothing prints on the console
x <- foo(1)
x
[1] 1
Why is this useful? Well, consider a function that plots something, but also returns some data. If we only care about the plot, then if the data is returned invisibly we can call the function without cluttering the console (and without doing an assignment). Actually, this is a common design pattern. Consider:
hist(mtcars$mpg) # plots, but nothing is printed to the console
h <- hist(mtcars$mpg, plot = FALSE)
h
$breaks
[1] 10 15 20 25 30 35
$counts
[1] 6 12 8 2 4
$density
[1] 0.0375 0.0750 0.0500 0.0125 0.0250
$mids
[1] 12.5 17.5 22.5 27.5 32.5
$xname
[1] "mtcars$mpg"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Now let us continue with our simple example, this time adding a single input parameter:
hello <- function(name){
return(paste0("Hello, ", name, "!"))
}
hello("Chris")
[1] "Hello, Chris!"
This function will fail if name isn’t provided. Similarly, we intend name to be a character string, but in reality the function will succeed for any type that can be coerced into a string. For example:
tryCatch(hello(), error=function(e) cat(e$message))
argument "name" is missing, with no default
hello(1) # works, but probably shouldn't
[1] "Hello, 1!"
So, let us modify the function as follows:
name is missing, print “Hello, World!”name is not a character string, throw an error.hello <- function(name){
if (missing(name)) return("Hello, World!")
if (!(class(name)=="character" & length(name)==1))
stop("Argument 'name' must be a single character string.")
return(paste0("Hello, ", name, "!"))
}
hello()
[1] "Hello, World!"
tryCatch(hello(1), error=function(e) cat(e$message))
Argument 'name' must be a single character string.
The point here is that it is often up to the programmer to ensure that functions behave correctly if the wrong inputs are provided. This is called defensive programming, and it is a good habit to form–especially if you intend your code to be used by others. Your code will be much easier to use if sensible errors/warnings are issued when it is used incorrectly.
Let us now consider an example which takes 2 arguments, x and y, and returns x^y:
power <- function(x, y) return(x^y)
While simple, this example lets us examine the idea of both positional and named parameters. The following calls will all calculate \(2^3=8\):
power(2,3)
power(x=2, y=3)
power(y=3, x=2)
power(y=3, 2)
For the first call, the arguments are matched by order. For the second and third, the arguments are matched by name. The last works because R can still match the formal arguments unambigously–3 is named y so matched accordingly; and 2 is matched with x since that is the only possibility remaining.
For readability, named parameters are probably best. That is, either of the 2nd or 3rd calls. Positional calls, such as the first call, are common, and probably acceptable if functions signatures are simple enough. Calls like the last example will only lead to trouble, and should be avoided.
Functions can also be given default values. For example, let us repeat the power example above, but let y default to 1 if no value is provided:
power <- function(x, y=1) return(x^y)
power(2)
[1] 2
power(2,1)
[1] 2
...While one can probably get by without it, one very useful construct in R is the ... argument, commonly called ellipsis.
The ellipsis argument essentially allows us to write functions with an arbitrary number of parameters. Formally, the ellipsis argument is what is called a pairlist. To understand it, consider the following:
foo <- function(...){
return(list(...))
}
foo(a=1, b=2, 3)
$a
[1] 1
$b
[1] 2
[[3]]
[1] 3
The part ‘list(...)’ captures ... into a standard R list. In this case, the list contains named parameters a and b with values 1 and 2, respectively; and an unnamed parameter with the value 3.
As a further example of the use of ..., let us write our own version of the min function:
min(1, 2, 3)
[1] 1
Actually, the min function works in more situations than this, but we’ll restrict ourselves to this scenario. Our version:
mymin <- function(...){
args <- list(...)
if (length(args)==0) return(NA)
if (length(args)==1) return(args[[1]])
mn <- args[[1]]
for (i in 2:length(args)) if (args[[i]] < mn) mn <- args[[i]]
return(mn)
}
mymin(1, 2, 3)
[1] 1
Variable scope is an important concern when writing functions. Consider the following variation on the hello example from earlier:
hello <- function(){
return(paste0("Hello, ", name, "!"))
}
name <- "Chris"
hello()
[1] "Hello, Chris!"
name <- "Christopher"
hello()
[1] "Hello, Christopher!"
name is not a formal parameter of the function hello, yet we can use it anyway. That is because name is defined in the environment which contains the function. When a variable is used, R will look in the current environment, then the parent environment, and so on, until a variable with a matching name is found. Now consider:
hello <- function(name){
return(paste0("Hello, ", name, "!"))
}
name <- "Chris"
hello("Christopher")
[1] "Hello, Christopher!"
name
[1] "Chris"
While there are certainly cases where this sort of thing is useful (Google function closures if you’re keen), it’s probably best to avoid it for now.
You’re far better off for now passing the name in as a formal argument. And, besides, this makes your code more reusable.
Finally, note that as a rule local copies of formal arguments are made if they are modified inside a function. For example:
foo <- function(x){
x <- 2 * x
return(x)
}
x <- 5
foo(x)
[1] 10
x
[1] 5
Or maybe even more obviously (if not more appropriately):
foo <- function(){
x <- 2 * x
return(x)
}
x <- 5
foo()
[1] 10
x
[1] 5
if, if elseThe simplest conditional is an if statement. The basic syntax is:
if (condition) true.expression
or
if (condition{
true.codeblock
}
So, for example:
today <- "Saturday"
if (today %in% c("Saturday", "Sunday")) cat("I like weekends.")
I like weekends.
today <- "Monday"
if (today %in% c("Saturday", "Sunday")) cat("I like weekends.")
The basic syntax for an if-then-else construct in R is:
if (condition) true.expression else false.expression
or
if (condition){
true.codeblock
} else {
false.codeblock
}
Note that else must be on the same line as } when executing code interactively or an error will be thrown. If the if-then-else is inside a function then else can happily appear after a newline.
So, for example:
x <- runif(1)
y <- runif(1)
if (x < y) {
cat(sprintf("%.4f < %.4f", x, y))
} else {
cat(sprintf("%.4f > %.4f", x, y))
}
0.7237 > 0.4113
The if-then-else construct is so common that many languages have special syntax for it, often called the ternary operator (usually something like condition ? true.expression : false.expression). R has a similar feature provided by the ifelse(test, yes, no) function, where test is a condintion resulting in TRUE or FALSE, and yes and no are code blocks which yield values when test is TRUE or FALSE, respectively. Using this, the previous example can be rewritten:
cat(ifelse(x < y, sprintf("%.4f < %.4f", x, y), sprintf("%.4f > %.4f", x, y)))
0.7237 > 0.4113
Calls can be nested by recursively placing ifelse calls in place of each no expression, but this is at the expense of readability. This is almost never a worthwhile thing to do (and we only really even mention it as a possibility because heavy Excel users might be inclined to do this unless told otherwise).
The most generic form of the if-then-else construct is:
if (expression){
block
} else if (expression.2){
block.2
} else if (expression.3){
block.3
} ... else {
block.n
}
A common use for if-the-elses is recoding variables. So, lets take a look at this. Assume we have a dollar amount that we want to place in self-determined categories: 1, <$30000; 2, <$80000; and 3, ≥$80000. Then for any single amount we could do the following:
(x <- rnorm(1, 50000, 10000))
[1] 59189.77
if (x < 30000) 1 else if (x < 80000) 2 else 3
[1] 2
This is fine when there aren’t that many cases, and when the code blocks for each case are simple. But if this isn’t the case, or if we want to do repeat the exercise a number of times, then we will generally find a function more useful:
recode <- function(x){
if (x < 30000){
return(1)
}
else if (x < 50000){
return(2)
}
else{
return(3)
}
}
recode(x)
[1] 3
Note that in the above example we could have laid out the if-then-else on one line like the example before. But we write it out this way to demonstrate how we would set things out if the code inside each condition was more complicated and spanned multiple lines.
Now, what if we want to do this for every item in a large vector? We could do this (we’ll see for loops next, but bear with me):
x <- rnorm(100, 50000, 10000)
y <- c()
for (i in x) y <- c(y, recode(i))
table(y)
y
2 3
50 50
Applying a function to every item in a collection, and collating the results is common. So common, in fact, that there is specific functionality provided to do this:
y <- sapply(x, recode)
table(y)
y
2 3
50 50
sapply is one of a number of functions in the so-called ‘apply’ family. Others are apply, lapply, mapply, and vapply. There is even a version of lapply, mclapply, that uses multiple CPUs to make code run faster.
The sapply example also demonstrates the use of a function as a function argument. In R, functions are first class objects, which means they can be used just like any other variable. This is very powerful, and has a number of useful consequences.
This is all a bit beyond the scope of this introductory course. Still, learning about the family of apply functions and functionals will certainly prove useful.
See: Advanced R by Hadley Wickham.
Returning briefly to the example, we note that the use of explicit loops in R is often much slower than vectorised operations such as the use of sapply. Let’s look again, but time the results:
x <- rnorm(10000, 50000, 10000)
if (require(rbenchmark)){
benchmark(
loop1={
y <- c()
for (i in x) y <- c(y, recode(i))
},
loop2={
y <- vector("numeric", length(x))
for (i in 1:length(x)) y[i] <- recode(x[i])
},
sapply={
y <- sapply(x, recode)
},
replications=10,
columns=c('test', 'elapsed', 'relative')
)
}
test elapsed relative
1 loop1 3.854 20.720
2 loop2 0.386 2.075
3 sapply 0.186 1.000
So, it is worth being aware of the effect that different choices can have in practice. Here, 3 ways of achieving the same thing all had different running times. And one of those choices, one was relatively poor.
Write a plot function that takes two parameters: x, a numeric vector; and difference, a boolean. When difference is TRUE, plot the difference of x, otherwise plot x. Do this with a single if statement. Hint: ?diff.
Write a plot function that takes two parameters: x, a numeric vector; and difference, a boolean. When difference is TRUE, plot the difference of x, otherwise plot x. Do this with an if-else.
forThe basic syntax for a for loop is:
for (item in collection){
statements
}
Basically, if x is an object in R for which length(x) is valid, then it can be used in a for loop, with item at each iteration being x[1], x[2], and so on.
Let’s consider a relatively simple example–just a running total. A common way to do this is as follows:
x <- rpois(10, 10)
sum <- 0
for (i in 1:length(x)) sum <- sum + x[i]
sum
[1] 111
Another way to do the same thing is:
sum <- 0
for (i in x) sum <- sum + i
sum
[1] 111
Both methods are perfectly acceptable, though the second is slightly more succint.
However, in more complicated examples, it is often useful to keep the position in the sequence. That said, that can be done by explicitly creating a counter:
i <- 1
for (item in collection){
statements
i <- i + 1
}
It is common to nest for loops. While not very interesting from a practical point of view, let us demonstrate by summing all the elements in a matrix:
x <- matrix(rnorm(9), nrow=3)
sum <- 0
for (i in 1:3){
for (j in 1:3){
sum <- sum + x[i,j]
}
}
sum
[1] -2.182877
Consider a data frame, df, and the loop:
for (el in df){
cat(el, "\n")
}
What do you think will be printed at each iteration? That is, what do you think el is? Well, the internal representation of a data frame is a vector where each element is a column of the data frame. So, df[i] will be column i of df.
whileThe basic syntax for a while loop is:
while (condition){
statements
}
In this situation, the loop continues as long as condition evaluates to TRUE.
For example:
i <- 1
while(i<=5){
cat(i, "\n")
i <- i + 1
}
1
2
3
4
5
It is a relatively common error to write a while loop where the condition never evaluates to FALSE. So setting a counter with some sort of upper limit like above is often a good idea when writing code:
i <- 1
while (condition & i <= upper.limit){
statements
i <- i + 1
}
That said, a common cause of an infinite loop is forgetting to implement the counter i.
plot functionWe’ve touched on it only very lightly, but objects in R are generally of a particular class. In addition, a number of built-in functions are so-called generic methods which behave differently depending on the input object’s class. plot is an example of a generic method:
plot
function (x, y, ...)
UseMethod("plot")
<bytecode: 0x55817d9fc118>
<environment: namespace:graphics>
When a generic method is called, the class of its first argument is checked, and if a version of that method specific to that class exists it is called. If no specific method exists, then some sort of default method will be called instead.
This probably isn’t clear at this point, so let’s consider a practical example. The built-in dataset AirPassengers is a time series object and has a class of ts. When plot is called on this object, R looks for a method called plot.ts and calls it if found:
plot(AirPassengers)
plot.ts(AirPassengers)
print is another generic function that we’ve already encountered, possibly without realising. print is called implicitly whenever we enter the name of an object at the interactive console, and what is printed is usually specific to the type of object.
In the case of plotting, the use of generic methods in this way is very useful. It means we can often call plot on an object and get something tailored for that type object. Moreover, the syntax for the various plot methods will generally be reasonably consistent, so we really only have to learn how to use a single command.
The following shows the signature for plot with a range of more common parameters (there are many more options that can be tweaked–see ?par for more):
plot(x, y,
xlim, ylim, xaxt, yaxt, axes,
main, xlab, ylab, pch, col, lty, lwd)
A brief description of each is as follows:
x is the object to plot, or data for the x-axisy is the data for the y-axisxlim is a custom range for the x-axis of form c(x1, x2)ylim is a custom range for the y-axis of form c(y1, y2)xaxt - set to "n" to suppress x-axisyaxt - set to "n" to suppress y-axisaxes - set to FALSE to suppress both x and y-axesmain is a plot titlexlab is the title for the x-axisylab is the title for the y-axispch is the plot charactercol is the colour of lines, points, etc.lty is the line typelwd is the line width.Here’s a few examples:
plot(AirPassengers,
main="Monthly Airline Passenger Numbers", lty=2, col="grey", lwd=2)
plot(AirPassengers, type='h')
plot(AirPassengers, type='p', pch=16)
plot(AirPassengers, type='b')
plot(1:20, rep(1,20), pch=1:20, xlab="", ylab="", axes=FALSE)
plot(1:12, rep(1,12), type="n", xlab="", ylab="", axes=FALSE)
for (i in 1:12) abline(v=i, lty=i)
plot(rnorm(10000,0,1), rnorm(10000,0,1), col=rgb(0,0,0,0.1), pch=16)
plot(table(rpois(1000, 5), rpois(1000, 3)))
You can make pretty much anything with basic plotting commands such as plot, lines, text, points, and polygon. But plenty of common plots have base implementations provided…
hist(mtcars$mpg, col="grey", border="white", xlab="MPG", main="")
boxplot(mtcars$mpg, notch=TRUE, horizontal=TRUE)
boxplot(mpg~cyl, data=mtcars, xlab="cylinders", ylab="MPG")
barplot(VADeaths)
barplot(VADeaths, beside=TRUE, col=rainbow(5))
pie(c(Sky=78, "Sunny side of pyramid"=17, "Shady side of pyramid"=5),
init.angle = 315,
col = c("deepskyblue", "yellow", "yellow3"),
border = FALSE)
plot(iris)
if (require(maps)) map('world', fill=TRUE)
ggplot2This section is about base graphics. However, just as we demonstrated how to summarise a data frame using base R, but then advocated for the dplyr package instead, we do a similar thing here.
ggplot2 is a package which offers a completely different method for producing graphics. We aren’t necessarily suggesting it should be used in favour of base graphics, but in a number of situations it might be preferred.
For simple plots, base graphics are probably easier to use. But as plots get more complicated, or the amount of customisation increases (which happens, for example, when one attempts to apply Stats NZ graph standards), ggplot may end up being more tenable.
As for dplyr we will not dive too deeply here. We’ll just provide a single example, and leave interested readers to follow up themselves. See: docs.ggplot2.org.
As a rule, data will need to be made ‘skinny’ for use with ggplot. Using the dataset VADeaths which we used to demonstrate the barplot function:
library(reshape2)
(VA <- melt(VADeaths))
Var1 Var2 value
1 50-54 Rural Male 11.7
2 55-59 Rural Male 18.1
3 60-64 Rural Male 26.9
4 65-69 Rural Male 41.0
5 70-74 Rural Male 66.0
6 50-54 Rural Female 8.7
7 55-59 Rural Female 11.7
8 60-64 Rural Female 20.3
9 65-69 Rural Female 30.9
10 70-74 Rural Female 54.3
11 50-54 Urban Male 15.4
12 55-59 Urban Male 24.3
13 60-64 Urban Male 37.0
14 65-69 Urban Male 54.6
15 70-74 Urban Male 71.1
16 50-54 Urban Female 8.4
17 55-59 Urban Female 13.6
18 60-64 Urban Female 19.3
19 65-69 Urban Female 35.1
20 70-74 Urban Female 50.0
library(ggplot2)
ggplot(VA, aes(Var2, value)) +
geom_bar(aes(fill=Var1), stat="identity", position="dodge")
As we’ve seen, R has particularly strong graphical capabilities. At the same time, there are a vast number of freely available libraries out there for making compelling interactive content, including graphics and visualisations; and there’s no reason why we shouldn’t try and use these.
So, there are a number of R packages which simply provide a wrapper for third-party libraries and tools. One particularly common case is that of HTML and javascript. There are plenty of packages which just wrap common javascript functionality, providing functions which, essentially, just return correctly formed HTML which can be viewed as such.
We won’t get into it too deeply here. We’ll just present a few examples, but these examples should provide enough of an idea that users should feel comfortable enough exploring other similar packages. That said, readers are encouraged to check out the htmlWidgets gallery for other cool examples:
plotlyIf you have some familiarity with ggplot2, one particularly easy way of making a plot interactive is to simply use the plotly::ggplotly function. Consider the following toy dataset:
df <- data.frame(
Z = rep(head(LETTERS, 5), times = 10),
X = as.numeric(rep(c(1:10), each = 5)),
Y = abs(rnorm(50))
)
head(df)
Z X Y
1 A 1 0.51595820
2 B 1 1.42194529
3 C 1 0.78200703
4 D 1 1.49504697
5 E 1 0.07996539
6 A 2 0.89793809
We could make a stacked area chart as follows:
library(ggplot2)
p <- ggplot(df, aes(x=X, y=Y, fill=Z)) + geom_area()
p
And it can be made interactive:
library(plotly)
ggplotly(p) #, width="100%")
This won’t always work perfectly or, in the case of complicated plots, even at all. But it’s easy and convenient all the same.
highcharter (highcharts)The highcharter package proveds a wrapper for the highcharts library. This is a particularly interesting case because, of course, the Stats NZ website currently makes use of highcharts in media releases and the like. For this reason, lets pick a plot that appears on the website, and see how closely we can emulate it using nothing but the highcharter package. We choose the plot labelled ‘unemployment rate’ from Labour Market Statistics: June 2016 quarter as follows:
The data for this plot has been saved in this project as data/unemp.rda and can be loaded with the load function:
load("data/unemp.rda")
head(unemp)
ref label date val
2399 HLFQ.S1F3S Seasonally adjusted 2007-06-30 3.6
2400 HLFQ.S1F3S Seasonally adjusted 2007-09-30 3.5
2401 HLFQ.S1F3S Seasonally adjusted 2007-12-31 3.3
2402 HLFQ.S1F3S Seasonally adjusted 2008-03-31 3.8
2403 HLFQ.S1F3S Seasonally adjusted 2008-06-30 3.8
2404 HLFQ.S1F3S Seasonally adjusted 2008-09-30 4.0
A relatively vanilla highchart can be produced as follows:
library(highcharter)
h <- hchart(unemp, "line", x = date, y = val, group = label, width = "100%") %>%
hc_xAxis(title = list(text = NULL)) %>%
hc_yAxis(title = list(text = "(Percent)")) %>%
hc_title(text = "Unemployment rate") %>%
hc_subtitle(text = "Quarterly, June 2007\u20132016") %>%
hc_credits(enabled = TRUE) %>%
hc_exporting(enabled = TRUE)
h
This isn’t too far off, though we can get much closer (other than the \(x\)-axis, which the author can’t be bothered with) by applying a theme:
thm <- hc_theme(
colors = c("#003333","#669999"),
chart = list(
backgroundColor = "#FFFFFF",
zoomType = "xy"
),
xAxis = list(
lineColor = "#000000",
lineWidth = 1,
crosshair = TRUE,
labels = list(
style = list(
color = "#000000",
fontSize = "10px"
)
)
),
yAxis = list(
lineColor = "#000000",
lineWidth = 1,
crosshair = TRUE,
title = list(
rotation = 0,
align = "high",
offset = 0,
x = 45,
y = -10,
style = list(
color = "#000000"
)
),
labels = list(
style = list(
color = "#000000",
fontSize = "10px"
)
)
),
title = list(
align = "center",
style = list(
color = '#000000',
fontFamily = "Arial",
fontSize = "11px",
fontWeight = "bold"
)
),
subtitle = list(
align = "center",
margin = 50,
style = list(
color = '#000000',
fontFamily = "Arial",
fontSize = "11px",
paddingBottom = "15px"
)
),
legend = list(
itemStyle = list(
fontFamily = 'Arial',
color = '#3E576F',
fontWeight = 'normal'
),
itemHoverStyle = list(
color = 'gray'
)
),
credits = list(
enabled = TRUE,
text = "Source: Statistics New Zealand",
href = "http://www.stats.govt.nz"
)
)
h %>% hc_add_theme(thm)
leafletIn much the same way as we did for the highcharter package, let’s find a map on the Stats NZ website and roughly emulate it in R using the leaflet package. We’ll use the 2013 Census population and dwelling Stats Map. Here’s a screenshot of this page, with Canterbury selected, and with a choropleth overlaid based on number of people:
StatsMaps
R is relatively capable when it comes to working with geospatial objects. The sp package, in particular, provides a number of data frame-like classes for representing spatial objects. In this case, we load a spatialPolygonsDataFrame which has been compressed and saved externally:
library(sp)
load("data/regc.shp.rda")
load("data/regc.tbl.rda")
rc2013 <- merge(regc.shp, regc.tbl[regc.tbl$year==2013,],
by=c("code", "label"))
plot(rc2013)
rc2013@data
code label year people occupied unoccupied
1 01 Northland Region 2013 151692 60192 13734
2 02 Auckland Region 2013 1415550 473448 33360
3 03 Waikato Region 2013 403638 152493 29514
4 04 Bay of Plenty Region 2013 267741 103500 13473
5 05 Gisborne Region 2013 43656 16185 1824
6 06 Hawke's Bay Region 2013 151179 58350 5925
7 07 Taranaki Region 2013 109608 43431 3855
8 08 Manawatu-Wanganui Region 2013 222672 87984 11523
9 09 Wellington Region 2013 471315 177813 15084
10 12 West Coast Region 2013 32148 13803 2796
11 13 Canterbury Region 2013 539436 208146 28320
12 14 Otago Region 2013 202467 80949 13317
13 15 Southland Region 2013 93339 38145 4683
14 16 Tasman Region 2013 47154 18885 2700
15 17 Nelson Region 2013 46437 18906 1317
16 18 Marlborough Region 2013 43416 18198 3960
We can then use the leaflet package to create an interactive map:
library(leaflet)
pal <- colorNumeric(
palette = "Greens",
domain = rc2013@data$people
)
ttip <- function(label, people, occupied, unoccupied){
sprintf("<table>
<tr><th colspan = 2>%s</th></tr>
<tr>
<td><b>People:</b></td>
<td style=\"text-align: right\">%s</td>
</tr>
<tr>
<td><b>Occupied dwellings:</b></td>
<td style=\"text-align: right\">%s</td>
</tr>
<tr>
<td><b>Unoccupied dwellings:</b></td>
<td style=\"text-align: right\">%s</td>
</tr>
</table>",
label,
format(people, big.mark = ','),
format(occupied, big.mark = ','),
format(unoccupied, big.mark = ','))
}
leaflet(width="100%") %>%
addTiles() %>%
addPolygons(data = rc2013, color = "#000000", weight = 1,
fillColor = pal(rc2013@data$people), fillOpacity = 0.8,
popup = with(rc2013@data,
ttip(label, people, occupied, unoccupied))) %>%
addLegend("bottomright", pal = pal,
values = rc2013@data$people,
title = "Number of people<br>(000)",
opacity = 1)
DT (Data Tables)Data Tables are a very popular way of adding interactive tables to websites, and the DT package provides an excellent wrapper. We won’t dive too deeply into this beyond providing a simple example:
library(DT)
DT::datatable(iris, rownames = FALSE,
class = 'cell-border stripe',
colnames = gsub(".", " ", colnames(iris), fixed = TRUE))
Here we take a look at the New Zealand Income Survey Super SURF. A copy of the file has been placed in the ‘data’ folder of this project. When using the project, this file can be referred to in R as "data/NZIS_Super_SURF.txt".
To import the SURF, use the read.csv command:
nzis <- read.csv("data/NZIS_Super_SURF.txt")
Ths SURF is a relatively simple file, and so the default arguments are all fine. However, the read.csv function is somewhat flexible, and the function call can be modified somewhat if the imported dataset doesn’t look quite right.
read.csv actually just calls a more general function, read.table, with default parameters appropriate for CSV files. See ?read.table and ?read.csv for details.
RStudio has a nice built-in dataset viewer. Just click the NZIS object in the environment tab or call View(nzis) to view the nzis data frame.
The SURF is actually 100 different synthetic files containing 11315 observations each. The suggestion on the product page is that these repeated samples can be used to demonstrate sampling error. So, let’s have a go…
dplyr::group_by and dplyr::summarise to calculate the mean of weekly_income for each replicate (i.e. for each unique surf_ID.?mean).?var). This is an estimate of the sampling variance.qnorm(0.975)=1.959964 times the standard error).?hist).?curve).Let us now just do a little exploratory analysis. Before getting started, though, let’s do a little data prep…
nzis dataset is effectively 100 datasets stacked on top of each other. Subset nzis to a single dataset, and store as nzis.sub.age_category is categorical (hint: ?factor).sex variable is categorical, and assign levels such that 1 is male and 0 is female.ethnicity variable is categorical, and assign levels as follows: 1=European, 2=Māori, 3=Pacific, 4=Other, 5=Māori combination, and 6=non-Māori combination.highest_qualification variable is categorical, and assign levels as follows: 1=no qualification, 2=school, 3=vocational / trade, 4=bachelor or higher, 5=other post-school.Now, let us look at how income differs by group. First, informally…
weekly_income against weekly_hours–note that both variables are continuous. Add / remove / change options to make the plot more attractive if you want.weekly_income against highest_qualification–note that highest_qualification is categorical (so a simple scatter probably isn’t the best plot choice).weekly_income against both sex and ethnicity.Now let’s look at how income differs by group a little more formally…
weekly_income as the dependent variable, and all remaining variables except random_ID and surf_ID as explanatory variables (hint: ?lm). Store the fit as fit.1.summary).plot on fit.1 and see what happens.Shiny applications are interactive web applications which are authored via the shiny package.
Essentially they consist of two components: a ui component, and a server component. The ui component contains the application layout, typically including any interactive controls (I say typically, because controls can be created dynamically or changed once an app is running, and this would be controlled via the server component). The server component contains all of the logic required to drive the app.
The following is a complete working application which you can start from here:
library(shiny)
ui <- basicPage(
numericInput("n", "number of observations:",
value=30, width = "100%"),
plotOutput("distplot")
)
server <- function(input, output){
output$distplot <- renderPlot({
x <- rnorm(input$n)
hist(x, col="grey", border='white',
freq=FALSE, xlab="X", main="")
})
}
shinyApp(ui = ui, server = server)
And, in fact, we can even place this in a function (useful for putting in packages, for example):
myfirstapp <- function(){
if (!require(shiny))
stop("Please install the shiny package.")
ui <- basicPage(
numericInput("n", "number of observations:",
value=30, width = "100%"),
plotOutput("distplot")
)
server <- function(input, output){
output$distplot <- renderPlot({
x <- rnorm(input$n)
hist(x, col="grey", border='white',
freq=FALSE, xlab="X", main="")
})
}
shinyApp(ui = ui, server = server)
}
myfirstapp()
This is the basic setup we will continue with, though others are possible. For example, say the app was called foo. It is common, and recommended for large apps, to place the ui component in a file called foo/ui.R, and the server component in a file called foo/server.R, and to launch the app by calling:
shiny::runApp('path-to-foo/foo')
Either way, let us now dissect the example app a little. The ui component is basically just a block of HTML which will be inserted into the running web page. In this case, it consists of a call to basicPage, which is the simplest possible layout. Layout functions such as this take a comma separated list of individual layout elements–in this case, a numeric input and a place holder for an R plot.
The numeric input has a ID of "n", and this is how we are able to access its contents in the server component. Specifically, we can refer to it in code as input$n. The second argument is a label, and is the text that appears above the control when the app is running–it can be set to NULL if desired. The third argument is the default value the numeric input will have–in this case 30. Finally, the last argument controls how wide the control will be rendered–100% of the parent container in this case.
The plot output takes just a single argument in this case–the ID. This is accessed a bit differently than the numeric input. In the case of the numeric input we wish to consume the value it contains. In the case of the plot output, we wish to update it. So, we do this by rendering a plot using the renderPlot function, and assigning it to output$distplot, where "distplot" here is the plot’s ID.
The plotOutput and renderPlot pattern here is quite common in shiny apps. That is, for some kind of object x, you will typically then put xOutput in the ui, and renderX in the server. We’ll see a few more examples of this soon.
Finally, we note that the plot updates whenever the value in the numeric input is changed. Functions like renderPlot are reactive, and simply by referring to an input, input$n in this case, they take a dependence on that input.
Anyway, let’s gradually ramp things up. First, let’s fancy up the ui a bit simply by using an alternative layout to basicPage:
library(shiny)
ui <- fluidPage(
titlePanel("Normal Distribution"),
sidebarLayout(
sidebarPanel(
numericInput("n", "number of observations:",
value=30, width = "100%")
),
mainPanel(
plotOutput("distplot")
)
)
)
server <- function(input, output){
output$distplot <- renderPlot({
x <- rnorm(input$n)
hist(x, col="grey", border='white',
freq=FALSE, xlab="X", main="")
})
}
shinyApp(ui = ui, server = server)
fluidPage is a layout with an overall title, a sidebar, and main content; and it takes 3 arguments as such. Other layouts are possible, and there is a reasonably good overview here:
Given a working understanding of R, shiny applications are relatively simple, and are a great way of sharing functionality with non-programmers. That said, there are still some relatively complicated things possible with shiny, and it is certainly possible to write big, complicated apps. We will introduce a couple of relatively complicated ideas related to shiny’s reactive model here. These are ideas which will server you well, but they may take a bit more head scratching than what we’ve seen so far. So don’t worry if things aren’t immediately clear.
One consequence of shiny’s reactive model is that a block of code that takes a dependence on some dynamic input, such as the renderPlot call above, will be re-executed whenever that input changes. Sometimes, this is not what we want. For example, say there is a block of code that is very CPU intensive and depends on several inputs. We would probably want to delay execution of the code until we’d finished updating all the inputs. Well, we can do that by isolating the inputs. Consider the following modification to the above app:
library(shiny)
ui <- fluidPage(
titlePanel("Normal Distribution"),
sidebarLayout(
sidebarPanel(
numericInput("n", "number of observations:",
value=30, width = "100%")
),
mainPanel(
plotOutput("distplot")
)
)
)
server <- function(input, output){
output$distplot <- renderPlot({
n <- isolate(input$n)
x <- rnorm(n)
hist(x, col="grey", border='white',
freq=FALSE, xlab="X", main="")
})
}
shinyApp(ui = ui, server = server)
By wrapping input$n, we have prevented the renderPlot call from taking a dependence, and so now nothing happens at all when input$n is changed. That’s clearly not what we want. So, let us add a button which we can click to force the plot to redraw:
library(shiny)
ui <- fluidPage(
titlePanel("Normal Distribution"),
sidebarLayout(
sidebarPanel(
numericInput("n", "number of observations:",
value=30, width = "100%"),
actionButton("go", "Go", icon = icon("refresh"))
),
mainPanel(
plotOutput("distplot")
)
)
)
server <- function(input, output){
output$distplot <- renderPlot({
input$go
n <- isolate(input$n)
x <- rnorm(n)
hist(x, col="grey", border='white',
freq=FALSE, xlab="X", main="")
})
}
shinyApp(ui = ui, server = server)
The value of input$go in this case is simply the number of times the go button has been clicked; and the renderPlot block takes a dependence on the button simply by referring to it.
The next idea we’ll consider is largely around good practice. Notice in the app that in the renderPlot block we both create the data to plot, and make the plot itself. I would argue that these tasks should be separated, and that the renderPlot should be concerned exclusively with constructing a plot as far as that is possible. This might not seem obvious at this point, but imagine the data to be plotted was created as part of a very complicated process, possibly involving the extraction of data from a database and / or a large transformation of some kind. Well, consider the following modification:
library(shiny)
ui <- fluidPage(
titlePanel("Normal Distribution"),
sidebarLayout(
sidebarPanel(
numericInput("n", "number of observations:",
value=30, width = "100%"),
actionButton("go", "Go", icon = icon("refresh"))
),
mainPanel(
plotOutput("distplot")
)
)
)
server <- function(input, output){
plotdata <- reactive({
input$go
n <- isolate(input$n)
rnorm(n)
})
output$distplot <- renderPlot({
hist(plotdata(), col="grey", border='white',
freq=FALSE, xlab="X", main="")
})
}
shinyApp(ui = ui, server = server)
Now the renderPlot block does nothing but draw a histogram. The object to be plotted is accessed by calling plotdata(), which is the result of reactive a expression.
?selectInput, ?rnorm, ?runif, and ?rexp).?rnorm defaults to the standard normal distribution, but other values of \(\mu\) and \(\sigma\) are possible other than 0 and 1; ?rt requires a degrees of freedom paramater to be present, and so on. Hint: ?uioutput.